La empresa procesa cientos de miles de documentos de los clientes que se requieren registrar en nuestra plataforma, para alivianar nuestra carga operativa, requerimos una solución donde podamos automatizar la carga de documentos, pero a su vez el análisis de los mismos.
!pip install boto3
!pip install thefuzz
!pip install nbconvert
!pip install pyppeteer
Requirement already satisfied: boto3 in c:\users\ander\anaconda3\lib\site-packages (1.24.28) Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in c:\users\ander\anaconda3\lib\site-packages (from boto3) (0.10.0) Requirement already satisfied: botocore<1.28.0,>=1.27.28 in c:\users\ander\anaconda3\lib\site-packages (from boto3) (1.27.28) Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in c:\users\ander\anaconda3\lib\site-packages (from boto3) (0.6.0) Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in c:\users\ander\anaconda3\lib\site-packages (from botocore<1.28.0,>=1.27.28->boto3) (2.8.2) Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\users\ander\anaconda3\lib\site-packages (from botocore<1.28.0,>=1.27.28->boto3) (1.26.11) Requirement already satisfied: six>=1.5 in c:\users\ander\anaconda3\lib\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.28->boto3) (1.16.0) Requirement already satisfied: thefuzz in c:\users\ander\anaconda3\lib\site-packages (0.20.0) Requirement already satisfied: rapidfuzz<4.0.0,>=3.0.0 in c:\users\ander\anaconda3\lib\site-packages (from thefuzz) (3.4.0) Requirement already satisfied: nbconvert in c:\users\ander\anaconda3\lib\site-packages (6.4.4) Requirement already satisfied: jupyter-core in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (4.11.1) Requirement already satisfied: nbformat>=4.4 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (5.5.0) Requirement already satisfied: defusedxml in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.7.1) Requirement already satisfied: testpath in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.6.0) Requirement already satisfied: entrypoints>=0.2.2 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.4) Requirement already satisfied: traitlets>=5.0 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (5.1.1) Requirement already satisfied: pygments>=2.4.1 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (2.11.2) Requirement already satisfied: bleach in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (4.1.0) Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (1.5.0) Requirement already satisfied: jinja2>=2.4 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (2.11.3) Requirement already satisfied: jupyterlab-pygments in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.1.2) Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.8.4) Requirement already satisfied: beautifulsoup4 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (4.11.1) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.5.13) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\ander\anaconda3\lib\site-packages (from jinja2>=2.4->nbconvert) (2.0.1) Requirement already satisfied: jupyter-client>=6.1.5 in c:\users\ander\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (7.3.4) Requirement already satisfied: nest-asyncio in c:\users\ander\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (1.5.5) Requirement already satisfied: jsonschema>=2.6 in c:\users\ander\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert) (4.16.0) Requirement already satisfied: fastjsonschema in c:\users\ander\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert) (2.16.2) Requirement already satisfied: soupsieve>1.2 in c:\users\ander\anaconda3\lib\site-packages (from beautifulsoup4->nbconvert) (2.3.1) Requirement already satisfied: packaging in c:\users\ander\anaconda3\lib\site-packages (from bleach->nbconvert) (21.3) Requirement already satisfied: webencodings in c:\users\ander\anaconda3\lib\site-packages (from bleach->nbconvert) (0.5.1) Requirement already satisfied: six>=1.9.0 in c:\users\ander\anaconda3\lib\site-packages (from bleach->nbconvert) (1.16.0) Requirement already satisfied: pywin32>=1.0 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-core->nbconvert) (302) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\ander\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat>=4.4->nbconvert) (0.18.0) Requirement already satisfied: attrs>=17.4.0 in c:\users\ander\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat>=4.4->nbconvert) (21.4.0) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (2.8.2) Requirement already satisfied: tornado>=6.0 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (6.1) Requirement already satisfied: pyzmq>=23.0 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (23.2.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\ander\anaconda3\lib\site-packages (from packaging->bleach->nbconvert) (3.0.9) Requirement already satisfied: pyppeteer in c:\users\ander\anaconda3\lib\site-packages (1.0.2) Requirement already satisfied: importlib-metadata>=1.4 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (4.11.3) Requirement already satisfied: websockets<11.0,>=10.0 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (10.4) Requirement already satisfied: pyee<9.0.0,>=8.1.0 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (8.2.2) Requirement already satisfied: certifi>=2021 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (2022.9.14) Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (1.26.11) Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (4.64.1) Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (1.4.4) Requirement already satisfied: zipp>=0.5 in c:\users\ander\anaconda3\lib\site-packages (from importlib-metadata>=1.4->pyppeteer) (3.8.0) Requirement already satisfied: colorama in c:\users\ander\anaconda3\lib\site-packages (from tqdm<5.0.0,>=4.42.1->pyppeteer) (0.4.5)
Utilizamos las librerias: google maps pandas plotly.expressos *fuzz
import boto3 as boto3
import googlemaps
import pandas as pd
import plotly.express as px
import os
from thefuzz import fuzz, process
Ponemos nuestras variables de entorno
AWS_ACCESS_KEY_ID = 'AKIA6N4KQVAWMFTHHHSZ'
AWS_SECRET_ACCESS_KEY = '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'
BUCKET_NAME = 'groupbtest'
FOLDER_PATH = './Untitled Folder'
REMOTE_FILE_NAME = 'Prueba.txt'
GMAPS_KEY = 'AIzaSyA5UtvbK6lYWGjJbZtU3prV-U88EG0jRKs'
df = pd.DataFrame(columns=['Direccion'])
Conexión a S3 de AWS
s3 = boto3.client('s3', aws_access_key_id = AWS_ACCESS_KEY_ID ,aws_secret_access_key = AWS_SECRET_ACCESS_KEY)
Listamos todos los archivos que tenemos en nuestra carpeta raiz, para previamente subirlos en el bucket conec3tado en S3
file_names = os.listdir(FOLDER_PATH)
for file_name in file_names:
try:
s3.upload_file(f'{FOLDER_PATH}/{file_name}',BUCKET_NAME,file_name)
print(f'{file_name} se ha subido exitosamente a {BUCKET_NAME} como {file_name}')
except FileNotFoundError:
print(f'El archivo {FOLDER_PATH}/{file_name} no se encontró')
except NoCredentialsError:
print('No se encontraron credenciales de AWS')
Dir1.txt se ha subido exitosamente a groupbtest como Dir1.txt Dir2.txt se ha subido exitosamente a groupbtest como Dir2.txt Dir3.txt se ha subido exitosamente a groupbtest como Dir3.txt Dir4.txt se ha subido exitosamente a groupbtest como Dir4.txt
Creación de método el cual nos genera las direcciones alternas que vienen en nuestros archivos planos
def direccionesAlternas(direccion):
direcciones = []
if direccion[4] == '-':
direcciones.append('Carrera ' + direccion[1] + ' # ' + direccion[3] + ' ' + direccion[5])
direcciones.append('Carrera ' + direccion[1] + ' Nro ' + direccion[3] + ' - ' + direccion[5])
direcciones.append('Carrera ' + direccion[1] + ' Numero ' + direccion[3] + ' - ' + direccion[5])
direcciones.append('Carrera ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
direcciones.append('Kra ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
direcciones.append('Calle ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
direcciones.append('Trasversal ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
if direccion[4] != '-':
direcciones.append('Carrera ' + direccion[1] + ' # ' + direccion[3] + ' ' + direccion[4])
direcciones.append('Carrera ' + direccion[1] + ' Nro ' + direccion[3] + ' - ' + direccion[4])
direcciones.append('Carrera ' + direccion[1] + ' Numero ' + direccion[3] + ' - ' + direccion[4])
direcciones.append('Carrera ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
direcciones.append('Kra ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
direcciones.append('Calle ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
direcciones.append('Trasversal ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
return direcciones
Método para calcular el porcentaje mediante la libreria Fuzz con la lógica difusa
def calcularPorcentaje(diroriginal):
direccion = diroriginal
direcciones = direccionesAlternas(direccion)
temporal = " ".join(diroriginal)
for direccionFictica in direcciones:
ratio = fuzz.partial_ratio(temporal.lower(),direccionFictica.lower())
if ratio >= 90:
df.loc[len(df)] = direccionFictica.strip()
Respuesta de la lista de objetos que hay en el bucket previamente creado, se lee cada uno de los archivos se extrae la información, genera con el metodo direcciones alternas las direcciones y luego calculamos el porcentaje el que sea mayor o igual a 90 se agrega a un dataframe.
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
for obj in response.get('Contents', []):
try:
response = s3.get_object(Bucket=BUCKET_NAME, Key=obj['Key'])
data = response['Body'].read()
data_str = data.decode('utf-8').strip()
diroriginal = data_str.split(' ')
calcularPorcentaje(diroriginal)
except Exception as e:
print(f"An error occurred: {e}")
Imprimimos Dataframe
print(df)
Direccion 0 Carrera 70 # 26A 33 1 Carrera 70 # 26A 80 2 Carrera 70 # 70 88 3 Carrera 70 # 86 33
Inicializamos gmaps_key con la API de Google
gmaps_key = googlemaps.Client(key = GMAPS_KEY)
Agregamos a nuestro dataframe los nombres de columna LAT, LON, Color, Tamaño LAT y LON son igual al valor que nos responde el API al darle la dirección Color y tamaño ses para el dibujado del mapa
df['LAT'] = None
df['LON'] = None
df['Color'] = 1
df['Tamaño'] = 1
for i in range (0,len(df),1):
geocode_result = gmaps_key.geocode(df.iat[i,0])
try:
lat = geocode_result[0]["geometry"]["location"]["lat"]
lon = geocode_result[0]["geometry"]["location"]["lng"]
df.iat[i,df.columns.get_loc("LAT")] = lat
df.iat[i,df.columns.get_loc("LON")] = lon
except:
lat = None
lon = None
Vemos la versión final del dataframe
df
| Direccion | LAT | LON | Color | Tamaño | |
|---|---|---|---|---|---|
| 0 | Carrera 70 # 26A 33 | 6.228612 | -75.591616 | 1 | 1 |
| 1 | Carrera 70 # 26A 80 | 6.228965 | -75.591167 | 1 | 1 |
| 2 | Carrera 70 # 70 88 | 6.216905 | -75.592414 | 1 | 1 |
| 3 | Carrera 70 # 86 33 | 6.23553 | -75.591561 | 1 | 1 |
Dibujamos el mapa con plotly recibiendo el dataframe y sus valores
fig = px.scatter_mapbox(df, lon = df['LON'], lat = df['LAT'], zoom = 10, color = df['Color'] , size = df['Tamaño'], width=900 , height=600 ,title='DIRECTIONS MAP')
fig.update_layout(mapbox_style = "open-street-map")
fig.update_layout(margin = {"r":0,"t":50,"l":0,"b":10})
fig.show()